Improve CBO estimates for correlated columns#11324
Conversation
|
TPC benchmark results with SF1000 ORC |
|
prefix already approved in #11066 |
d85c0a2 to
0476291
Compare
613370d to
eb33a4a
Compare
Is it about correlation in a sense of query outer scope references, or ... ? |
This is about correlation in the data between columns (e.g. nation, city) |
|
Results with latest PR |
sopel39
left a comment
There was a problem hiding this comment.
small comments. Did tests changed?
Currently we assume that there is no correlation between the terms of a filter conjunction. This can result in underestimation as there is often some correlation between columns in real data sets. In particular, predicates inferred on the build side relation through a join with a partitioned table are often correlated with user provided predicates on the build side. Estimation for filter conjunctions now applies an exponential decay to the selectivity of each successive term to reduce chances of under estimation. optimizer.filter-conjunction-independence-factor is added to allow tuning the strength of the independence assumption.
Currently we assume that there is perfect correlation between the clauses of a join and use the most selective clause for driving output row count estimation. This can result in overestimation as it not necessary that columns in join keys are perfectly correlated in real data sets. Estimation for multi clause joins now applies an exponential decay to the selectivity of each successive term to reduce chances of over estimation. optimizer.join-multi-clause-independence-factor is added to allow tuning the strength of the independence assumption.
eb33a4a to
44c51cb
Compare
| List<Symbol> expressionSymbols = expressionUniqueSymbols.get(term); | ||
| int expressionPartitionId; | ||
| if (expressionSymbols.isEmpty()) { | ||
| expressionPartitionId = symbolPartitions.size(); // For expressions with no symbols |
Description
Overall goal of the PR is to work towards enabling optimizer.default-filter-factor-enabled by default.
If default-filter-factor is enabled with existing implementation, it improves q18 and q21 on tpch significantly.
However, it also results in regressions on certain benchmark queries (tpcds partitioned q64, tpcds unpartitioned q78).
These changes update the estimation logic of filters and joins to address the problems
with underestimation of filter conjunctions and overestimation of multi-clause joins observed
when default-filter-factor is enabled with existing implementation.
Improvement
Query optimizer
Improves CBO estimates in the presence of hard to estimate terms.
Related issues, pull requests, and links
Picks first (n-1) commits from #11066
Documentation
( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
(x) Release notes entries required with the following suggested text: